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ABSTRACT " ' 

The paper investigates and analyses the current state 
of the art of criterion-referenced measurement (CRM) , with a view to 
determining its use in training and instructional programs. It 
presents a reveiw of the literature pertaining to the following 
aspects: a brief history of CRM; a definition and comparison of 
criterion-referenced and norm-referenced * measures^ usage of the two 
measures; and the construction and evaluation of criterion-referenced 
tests in terms of validity^ reliability^ and other test 
characteristics. The literature supports the following conclusions: 
(1) all definitions of CRM stress score interpretation as 
representing what the individual can do relaftive to instructional 
objectives rather than other individuals; (2) criterion-referenced 
information is valuable in making certain decisions based on what a 
person can 'do at a given point in the training cycle; (3) CRM has 
focused much attention on behavioral objectives and training 
outcomes; (U) behavioral olDjectives must be carefully written to 
effectively direct and measure instruction; (5) more than one measure 
should be used to validate any CRM to decrease the error assoc^iated 
with its measurein.ent; (6) it is difficult to develop objective 
procedures necessary for CT.Yi of complex behavior; (7) CRM supplements 
but should not replacie^ormative tests in training; and (8) more' 
research is needed before extensive use of CRM in instructional - 
programs can be recommended. (Author/NJ) 
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Chapter 1 
INTRODUCTION 

A great deal of work has been done in training an*d * 
educational evaluation and measarement since E. L. Thorntiike's 
(1918) declaration of faith, '^Anything that ,exists at all 
exists in some quantity, and anything that exists in some 
quantity is capable of being measured." (p. 16) ^ 
The concept of criterion -referenced measurement 

v(CRM) has received a great deal of attention recently in 
training, educat iona,l , and measurement literature. Trow 
(1961) and others have suggested that it may mark the 
beginning of a new era in measurement . The recent emphasis 
on CRM has been due to the concern about the measurement 
of . prof iciency or competency of occupational and educational 

. tasks . 

Glaser (1963) , a pioneer in CRM, stated: 

. . . ma ay of us are beginning to recognize that the 
.problems of assessing existing levels of competence and 
achievement, and the conditions that produce them 
require some additional consideration. (p. 531) 

Glaser (1963) has- suggested that what is needed in 
measuring co^npetency is: 

. . . explicit information as to what the individual 
can or cannot do.; Cr,iterion-ref erenced measures indicate 
the content the behavioral repertory, and the corre- 
spondence between what an individual does and the 
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'underlying , continuum of achievement. Measures which 
assess student achievement in terms of a criterion 
standard thus provide information ^s to the degree of 
competence attained by a particular student which is 
independent of J^ef erenced^ to the performance of others . 
(p^ 520) 

A main issue in the CRM movement is the distinction 
between norm-referenced and cl*i terion~ref erenced approaches 
to measurement. Norm~ref er'enced measurement (NRM) indentxfies 

» 

ft 

an individual's test performance in relation to the performance 
of others on the same measure. CRM identifies an individual's 
performance with respect to specified performance standards. 

Jackson (1970) has pointed out that it has become 
increasingly clear that measurement by norm-referenced tests 

does not provide^ the information that is needed in making 

< 

certain kinds of decisions about instructional programs. 

Cronbach and Gleser (1965) have questioned the usefulness 

of classical test theory and NRM for all testi;ig situations. 

Popham a^nd Husek (1969) concluded that: . 

. . . the problem is now not only how to summarize 
a student's performance on a test, but also how to 
insure that a test is cc^nStructed (and judg.ed) in a 
manner appropriate for .its use, even if its use is not 
in th6 classical framework. (p. 1) 

Although mosl of the literature on CRM has come from 

educational sources, its use has been advocated for industrial, 

military, business, and governmental training programs and 

promotions. (Fremer, 1972; Garvin, 1971; Goldstein, 1974; 

Swezey, Pearlstein ,> and Ton, 1974; Thronton and Wasdyke,, 

■ '■ \ , 

1972) Goldstein (1974) has poirfted out that: \ 

The norm-referenced measures tell -us that one student 
is more proficient than another, but they do not provide 
much information about the degree of proficiency in 
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relationship to the tasks involved. Unfortunately,"^ 
. many training evaluations have employed norm^ref erenced 
measures to the exclusion of other 'forms of measurement. 
In order to properly evaluate training programs, it is 
necessary to obtain criterion-referenced measures that 
provide information about the skill level of the trainee 
in relationship to the expected program achievement 
levels. (pp. 63-64) - 

/ - ^ ' 

Measurement/ specialists (Cronbach, 1963; Ebel, 1962; 

Hamblaton and Novjxfk, 1973; Livingston, 1972; and Millman,^ 

1974) have indicated that there is a* pressing need to 

develop achievement or performance measurement theory. 

Glase-r and Nitko (1971) have asserted that: 

Tests that measure instructional outcomes and that 
are use d for making instructional decisions demand 
speci al characteristics — characteristics that are 
different from the mental test model that has been 
successfully applied in aptitude testing work. (p. 652) 

Purpose and Scope of This Paper 

The purpose of this paper was to investigate and 
analyze the current state-pf -the-art of CRM to determine 
the feasibility of using it in training and ii]struct ional 
programs. 

The following questions were posed by the writer in 
an attempt to analyze CRM: 



1. What is criterion-referenced measurement? 



2. What are the differences and similarities 

be tween cri terion -referenced and norm-referenced 
measui^es?, . ' . 

3. * When and how should criterion-referenced measure- 

ment be used? ' • 

4. How is a cri ter:^on-ref erenced test constructed? 



ERIC 



5. How can a cri terionrref erenced test be evaluated 
in terms of validity, reliability, discrimination 
and other test characteristics?^ 

Throughout this paper, training and ^education has 
been used interchangeably. It is the belief of the author, 
and, liiaL_aL_oJJbLexs-^ that education and training deal with- . 
the same instructional processes of acquiring skills, 
knowledges, and attitudes in order for an individual to 
perform in another environment. As Goldstein (1974) has 
pointed out, both of the disciplines deal with similar areas, 
such as specification of objectives , environment al design, 
apd evaluation . 

Writings^ of thosfe in education and those in other 
fields have tried to be synthesized. However, by the very 
fact that^ most of tjhie literature has come from education, 
this 'integration was difficult. 

A review of the literature in Chaper II pertains to ' 
the following aspects: a brief history of CRM; defining the 
term criterion-referenced measurement; a comparison between 
CRM and NRM; usage of CRM and NRM; writing* a criterion- 
referenced test; arid empirical and logical evaluation of 
criteriorf-ref eJrenced tests. 

.The summary and conclusi^ons are included in Chapter. 

III. 




Chapter II 

USING CRITERION -REFERENCED MEASUREMENT IN TRAINING 



'Since 1963, the area of -C^ has been a hot topic, 

with hundreds of articles and books being written abput its 

theoretical basis, development, use, and test parameters. 

This section- attempted to analyze and synthesize the 

current- sta>te-of-the-art of CRM and its role in tx^ining- 
» 

Evaluation and Measurement in Training 

In* the instructional process, learning has been 

defined as: . ' . ' 

... the process by which behavior is initiated or 
changed as a result of experience . . . through training 
and practice. (Gairry, 1963, p. 2) • • 

The particular aspects of behavior acquired ^by ^.n' 
individual depend upon how the training environment is 
designed and developed. What is taught and how it is* taught 
depends upon the objectives and values of the organization. 
(Lynton and Pareek , , 1967) 

Many facets of human behavior are involved in the 
instructional process: the learning of the subject, matter 
content and skills; and the processes i'nvolved in using them, 
such as critical thinking, retention, ti'ansfer , ' problem 
solving, and creating'. The attitudes and motivation toward 
'th-ese activities are also forms of behavior. The total design 
of a training envirojiment *is a complex enterprise, .afid there 



are many variables which foster, nature, guide, influence, 
and control human behavior within its structure. (Lynton 
and Pareekr 1967) ' 

Evaluation and measurement play an important role in 
the instructional proces§. It should be, no-ted, however, that 
the terms "evaluation^' and "measurement" have distinctive 
meanings. Measurement is concerned with the application of 
an instrum^5rt or instruments to co'l lect data for some specific 
purpose. (Green, 1970) In other words,, measurement refers 
to quantitive descriptions of behavior, things, or events. 

.(Gronlund, 1968) • «. 

\ 

\ Evaluation is- a broader concept than measurement in 

that it involves not only quarititatiye d^scrip'tions , but 

also qualitative descriptions.. Gronlund (1968) wrote^ 

In addition to- such numerical and verbal descriptions, 
evaluation includes yalue judgements concerning the thing 
described. Thus, when! we evaluate the a,chievement of a 
stu'dent, the effectiveness of instruction, or th^ 
* * appropriateness of a .curriculum, we are concerned with^ 
judging their Value of* worth. (Gronlund, 1968) 

Evaluation is k comprehensive and' ^ompl^x process . 

The procedural steps-, as described by Gronlund (1968), include: 

... (1) identifying the objectives (i.e., the ^ 
^desired outcomes) , (2) defining the objectives in 
behavioral terms (i.e., s^p^cifying the be'h^avior we are^ 
willing to accept as evidence of the desired learning) , 
(3) selecting, or constructing^, Instruments for measuring 
' (or 'describing) the behavior, and (4)' 'applying the ^ 
instruments and analyzing the results to determine th^ 
degree to which the desired learning outcomes 'have been 
achieved. • , - ^ 

The fundamental tiask of measurement is to provide 

information, for makitj^^ Jb.asic, 'essential decisions with 
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respect^o the instructional design and operation. (Nelson, 

1970) According 'to Glaser and Nitko (1971) , four activities 

of instructional design determine measurQment requiremen-ts . 

These are : * ^ 

, . . . the analysis of the subject-matter domain 
under consideration , diagnosis of the characteristics of 
the learner, design of the instructional environment, and 
the evaluation of the learning outcomes . (pp . 625-626) 

In the analysis of the subject matter , experts 
analyze the subject matter domain in terms of performance 
competencies. .The char^acterist ics of the domain are con- 
structed according to conceptual hierarchies and jo-per^ating 
rules In terms of increasing complexity of human perf orma:nce . 
The analysis and definition of instructionally relevant 
performance is of major concern. This can be accompl:^shed 
through the specification of behavioral objectives, vrans- - 
lating them into types of observable performance, and con- 
ducting research studies about different instructional 
methodologies. (Glaser and Nitko, 1971) 

Diagnosing the characteristics of the trainee involves 
the measurement of the behavior an individual has upon entering 
a program. In other words, these measurements provide inform- 
ation about existing pre-instructional behavior. This is 
helpful in starti^^g the instruction based on what the trainee 
already knows and can do. (Goldste^, 1974; Millman, 1972;' 
Mirsberger , 1974) 

The third activity is "that of designing the instruction 
al environment and specifying the conditions under which 
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learning can take place. ^ This allows the* individual to 

progress toward the traiaing goals described as subject- 

matter competence ,and acquire the 'desired outcomes of in- 

^ ' ' ' 

'struction, (Glaser and Nitko, 1971) 

The final activity of evaluation is measuring learn" 
ing outcomes.. This provides information about the extent to 
-which the instructional objectives have been attained and 
the extent to which the -behavior of the trainee approaches 
the performance criteria. The trainee is said to have 
mastery of the instructional objectives when the degree of 
performance has been attained as specified. by the desij0B^ 
of the -instructional program. (Glaser and Nitko, Idiyf 

Uip^^rger ,(1974) stated that: 




Evaluation, in the view of the trainee -oriented 
instructor, is the process of obtaining feedback which 
is then used to direct the remaining portion of the 
training program. (p. 34) 

Mirsberger's phases are similar to Glaser and Nitko 's 

(1971) stages^, but he addfe 'an on-the-job performance phase. 

Mirsberger's phases include: * 

1. Pretraining phase: that evaluation done before 
any actual training is started. 

2. Training phase: evaluation made throughout the 
learning period. 

3 . Posttraining phase : the evaluation made at the 
end of the training effort. 

4 . Performance phase : the evaluation of the matricula- 
ted trainee in an on-the-job situation after the 
training effort. (p. 34) 

In summary, learning in- a training environment is a 

4 * 

process of changing the behavior; of an individual from an 
initial entering' state to a specified terminal sta'te. 
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Instruction is the practice' of providing conditions and 
activities .for this transaction to occur. Evaluation, of 
which measurement is a^part. is the collecting of aata, 
assessments, and information about the instructional program 
and the trainee ;s performance. It is used to make basic 
decisions in developing the overall effectiveness of the 
training system. 

Histori car Perspective of Criterion-Referenced Measurement 

The psychological testing movement started with the 
Darwinian emphasis on differences between " indivi duals , and 
the theorical framework of test scores was developed to 
emphasize differences in abilities and traits. (Mehrens and 
Lehmann, 1969) Psychological testing has concentrated on 

0 

comparative interpretations. What the mental test measures 
is whatever causes some people to get high scores, and others 
to get low' scores. The psychologist is likely to say that the 
test measures nothing if everyone scores the same, except for 
variation due to errors of observation. '(Cronbach, 1971) 

With the development of psychological tests around 
the turn of the" Twentieth Century by Galton, Cattell, Binet, 
Goddard, Terman, Otis, and others, a new era in measurement 

4 

was born. The mental test (a term coined by Cattell 'in 
1890) , although developed to discover and predict aptitude, * ' 
was introduced, in the schoo-ls to measure achievement for 
diagnostic and training purposes. (Trow, 1971) 

Achievement testing is -different frotn aptitude 

/ 



testing in that: 

4 

An achievement test is used to measure an individual's 
present level of knowledge or skills or performance, an 
aptitude test is used to predict how well an individual 
may learn. (Mehrens and Lehmann, 1969, p. 73) 

*. 

After World War I, there was a boom nn standardized^ 
subject-matter tests, statistics and measurement courses, and 
tex^tbooks related to these ' fields . (Horrocks and Schoonover, 
a968) f \ 

Although the mental test was .devised toNdif f erent iate 
and compare individuals for recommending further trea]:ment, 
training, or Education, the procedures of^ assigning school 
marks got mixed-up in their use. Because the system of^ 
assigning grades was based on the probability curve, the mark 
a student received was based on what others di^d on the san>fe 
test, not on what level of knowledge, understanding, or skill 



proficiency the individual pupil had achieved. (Trow, 1971, 
p. ix) 

Recently, there has been a revival of interest in ^ 
absolute measurement, now retitled cri terion-ref ererice^d 
measurement. (Ebel, 1962; Glaser,. 1963; Popham and Husek,"^^^ 
19^9; Tyler, 1966) CRM has been around in this country since 
the ^arly part of the Twentieth €entury, with Sf:ales developed 
by Courtis, Thorndike , Ayres, and oth^s for measuring hand- 
writing, ccn;pcstion ,arithrae trie and other subjects. (Trow, 
1971) During £he *period from 1909^ to \915, a^series ot 
arithmetic tests and five scales for measuring abilities 
in English composition, spelling, drawing, and handwriting 
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,were published. (Odell, 1930) / 

In 1909, Thorndike publihsed a standardized achieve- 
ment scale, The Scale for -Handwriting of Children . The ^ 

r 

introduction of standard measures of achievement is most 
often attributed^/to^E. L. Thorndike, whose students were later 
to make great^contribut ions to the .field of measurement and 
achievement testing. \ (Horrocks and Schoonover, 1968) 

, Ayres* handwriting scale was devised by judges who 
studied and arranged different specimens of pupil handwriting 
according to quality. Suitably spaced specimens were select- 
ed to re^present different levels of proficiency and^these were 
reproduced as a guide for teachers. A teacher Could simply 
look at successive Ayres' scores, on a pupil's cumulative 
record and judge how the pupil's handwriting was progressing. 
(Cronbach, 1971) 

Ebel (1965) has pointed-out that the percen'^ge - ^ 
mastery grades^ which wer? once^-^idely favored in schools in 
the early 1^0' s represent^ a crude type of cri terion, 
measurement, although one that was generally unsatisfactory 
in practice. 

In 1913, Thorndike noted ttie limitations of NRM and 
grades since they did not indicate the mastery, amount, or 
type of skills and knowledges possessed by the student. 
Thorndike (1913) , in discussing the assigning of school 

s 

grades based on normative data , stated: 

... the vices of the old system • . . w as its relativity 
and indef ini teness--the fact already described that a 



12 

given mark did not mean any defined amount of knowledge, 
or power, olr skill — so that it was bound to be used" f or 
relative achievement only, ^ 

The proper remedy is not to eliminate all stimulus • 
to rivalry, and along with it a large part of the stimulus 
to achievement in general , but to redirect the rivalry 
into tendencies to go higher <bn an objective scale for 
absolute " achievement , to surpass one^s ovm past performance, 
to get into what, in athletic parlance , is) called a 
'higher class,* to compete within th^at class, and to 
compete cooperatively as one of a group in rivalry with 
another grotip. (pp. 287-288) 

Nevertheless, the old NRM system which Thorndike 
referred to is the one that is still us^d today by the major- 
ity" of evaluation experts. '(Trow, 1971) 



Defining Criterion-Referenced Measurement 

Glaser has been credited with having introduced the 

current-day definition of CRM. (Jackson, 1970) In one of 

Glaser *s more recent writings on the^ subject, the following 

definition was suggested: 

A criterion-referenced test is one that is deliberate- 
ly constructed to yield measurements that are directly 
interpretabla in terms of specified performance standards. 
(Glaser. and Nitko, 1971, p. 653) 

Glaser (1963) stated that criterion-referenced tests 

can be differentiated from norm-referenced tests in that they 

do not focus on the problem of individual differences. Rather, 

they .are aimed at indicating what an individual can do and 

cajinot do. 

Although Glaser's definition is the classical one 
used by most people, it is not the only one. Poph^am and 
Husek (1969) have proposed: 

Criterion -referenced measures are those which are 



( lb 



13 



used to ascertain an individual's status with- respect to 
some criterion: i. *e . , performance standard/ It is 
because the individual is compared with some established 
criterion, rather than other individuals , that these 
measures are described as criterion-referenced.^ The 
meaningf ulness of^an individual's score is not dependent 
on comparison with testees. We Want to know what the 
individual can do, not ho'w he st;ands in comparison with 
others. (p. 2) 

Ebel (1971) characterized CRM in terms of score 

distribution and interpretation: 

The essential difference between norm-ref erence.d and 
criterion-referenced measurements is iri the quantitive 
scales used to express how much the individual can do. 
In norm-referenced measurement the scale i^ usually 
•anchored in the middle, on some average level of perform- 
, ance for a particular group of individuals. Jhe units on 
the scale are usually a function of the distribat ion of 
performances above and below the average level . . In^ 
criteripn -referenced measurement the scale is uBually 
anchored at the extremities, a score' at the top of the 
scale .indicating complete or perfect mastery of some 
defined abilities; one at the bottom indicating complete 
absence of these' abilities The scale units consist of 
subdivisions of these tota,l score ranges. (p. 282) 

Wang (1969) has expressed that a criterion-referenced 
test^*. . . is an achievement test developed to assess the 
presence or absence of a specified criterion behavior describ"" 
ed- in an instructional objective/' (p. 14) 

It is interesting to note tha.t these various defin- 
itions afgVee in that they emphasize the direct interpret ability 
of scores, but differ in the extent to which they make refer- 
ence to the^method, by which the test is constructed. Ebel 
emphasized ' the scale from which interpretations are to be 
made, while Glaser stressed the construction. 

Most writers stress the method of construction, such 
as. Jackson (1970) who wrote: 
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. . . the term ^criterion-ref erenced' will^te used 
here to apply .to a test designed and constructed in a 
manner- that defines explicit rules* linking patterns of 
test performance to behavioral referents. (p. 3) 

The preceding concepts are s(3inewhat different than 
one other prevalent ,use of the term ^criterion -referenced used 
in psychometric literature. That principle involves correlat- 
ing the scores of an achievement measuring instrument (X) 
with a second measurement situation (Y) , such as another test 
or grade average. The Y score would be referred to as a 

f 

criterion score and the degree of relationship is expressed 
by the product -moment corre la t ion . (Tuckerman , 1972) 
Criterion-related validity is similar to this concept in that 
it is a technique for showing the relationship between test 
scores and an independent exteri;ial measure, such as a 
standardized test. (Karmel, 1970) 

Norm-Referenced versus Criterion-Referenced Measurement 

The heart of the issue concerning CRM and NRM is 

deriving meaning from the test score. The score received 

/ 

by an individual on any type of test is basically inert and 
and must be related semantically to the behavior of the 
individual. (Lord and Novick, 1968) EbQl (1962) stressed 
that: 



No test score, raw or standard, has much meaning as 
an abstract number. Additional data for interpretation 
must always be provided, either by the test producer or 
by the test user from his own knowledge and experience. 
The numbers which report standard score's ate no more 
intrinsically meaningful, and no more self -interpret ing , 
than raw scores. (p. 16) 



l6 



For the most part, measurement specialists have 
concentrated on interpreting the test scores primarily 
based on the, scores of others, ^ At p^^za^ent, the use of NRM 
is almost universal in theSjnited States. (Ebel, 1962)- 

^ Advocates of CRM ^are^^ying to operationally define 
standards upon which interpretations can be made directly 
from the score. These experts believe that norm-referenced 
interpretations have serioys limitations "... when they 
are employed with achievement tests that are used in in- 
structional systems seeking to ^be adaptive to the individual 
(Glaser and Nitko, 1971, p. 653) 

According to Glaser^and Nltko (1971) , JjRM has been 
so dominant in training and education because of the: 

. . . concentration of psychological test theory on 
trait variability and on the relat ive ^ difference between 
individuals, the reluctance of educators to specify 
precisely th^ir goals in terms of observable behavior, 
the reliance of measurement specialists on the mental 
test model, and' the desire of test constructors to 
build tests tha^t are applicable to many different 
instructional systems, (p, 657) 

As Popham a^nd Husek 41969) have observed,\J.t is 

impossible to tell a ^norm-ref e^enced test from a criterion- 

s 

referenced test by just looking at it. The difference is 
found by examining the purpose of the test, the manner in 
which it was ^constructed, the specificity of the information 
obtained about the domain of inst rue tionaliy relevant tasks , 
the generalizabili ty 6f the test performance, and the use of 
the scores. 

Arguments have been made that any achievement test 

J 
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defines a criterion because it is representative of desired 

outcomes, and that one can determine the pa^rticular tasks . 

an individual .can perform by just examining the responses on 

V 

the person^ -test . Jacksoji^ ( 1970) wrote: 

Any .test samples .thfe Qontent ^'of ,soma specif ie<l^ dpmain. 
Even though a t6st may b^ nornied so that an individual's 
j^score may be .compared with scores of some specified group, 
there is the 'assumption of some latent trait upon which 
Observed scores depend, and which the^ test is, therefore, 
said to measure.. Hence, there is^always an implicit be- ' 
havioral element, and even tests^that are described as 
norm-referenced are designed to yield inferences about, 
say, the amount of trait X that an individual has. . In 
contrast to a criterion-referenced test, however, the 
inference is of the form — more (or less) of trait X than 
the mean amount in population Y--rather than some specified 
amount that is meaningful in isolation. Cp. 2) 

« 

However, Glaser has argued that the way a normative 
test is constructed and designed negates* its use as a true 
criterion based on performance standards. In practice, de- 
sired outcomes have seldom been specified in performance 
terms prior to constructing a norm-ref erenped test. (Glaser 
and Nitko* 1971) When using a NRM, questions that appear on 
the final criterion test have been revised and arranged to 
maximize the test constructor's concept of what the distri- 
bution of final scores should be and how the terms should 

i 

function- Statistically . (Cox, 1971) , ^ 

Other determinates of test construction iiave been 
ease of administration and scoring. Lindquist (1968) has 
indicated that many valuable instruc tionally relevant tasks - 
are not being tested because of computer-scoriag restrictions. 
' All of these practices tend to^distort the results of a 
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person's score with respect to a clearly defined domain ^ ' 
of tasks and performance standards. (Glaser ' and Nitko , 
1971) ^ • ' 

With respect to specificitj; the information 
obtained .bjt CRM abouJ: the domain of tasks, there 'should be 
a logical transition from the domain to the test and vice 
Versa. There should be little difficulty in identifying the 
class of tasks that can be performed. Thus, all t^sks in 
the domain must be defined in observable behavior. (Thornton 
and Wasdyke, 1972)^ 

The attainment, of certain abilities, skills, and 
knowledges, can only be inferred based on observable perform- 
ance. In an occupational' area, t^he specified domain of 
tasks would be analyz6^J^d broken down into observable 
performance mea^Kireme^ . Criterion-referenced tests do not 
seek to 'indicate how much ability a student possesses along a 
hypothetical ability dimension, but whether certain kinds of 
tasks can be demonstrated. This implies an analysis of task 
structure in which each task description includes criteria 
of performance • In turn, a scoring system must be devised^ 
that will preserve information about^the tasks that an in- 
dividual can perform. (Fremer, 1972) Norm-referenced scores 
such as percentile ranks, t-scores, and grade equivalents 
Ipse the specificity of criterion information. (Ebel, 1962) 

There must be generalizabi li ty of test performance 
to total task domain. As the trainee progressess in a 
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program, the number of tasks become very large. The criterion- 
referenced test constructor must determine^ how long to make a 
test so that generalization can be made aboiyt which specific 
tasks a learner can perform. The norm-referenced test 
constructor does not have th^s problem since wide selection 
of items will result in variable scores so that it can be 
said that individual X can do more or has ' achieved more than 
individual Y. However, what individual X can actually do is 
really not known. An individual's item Responses provide 
only a weak basis for inference when norm-referenced tests 
are used* 

Table 1 shows key features of CRM and NRM,,as in- 
terpreted by Boehm (1973). 





Norm-Referenced 


Cntenon-Referenced 


1 General 
Purpose 


To nnake comparisons annong 
individuals , 


To determine how an mdividual 
functions relative to a criterion 


To make decisions about placement 
in progranns m which Only limited 
nunnbers of individuals can be 
accepted 


To program specificaily^for the 
mdividu?^ 


To deternnine for wnom a program 
"works" 


To determine whether an instruc- 
tional program "works' in 
developing criterion behaviors 


2 Item Types 


Itenns must discriminate annong 
individuals 


Items must correspond to 
crtterton levels 


Itenns ail sub)ects pass or all^ail 
ehniinated 


Items must provide explicit infor- 
mation about tvAaf an individual 
can or cannot do 


3 Content 


Content may or may not nriatch 
particular classroom qoals 


Content must m.nch clasi.rOOm 
objectives jvb<ch hiive oe'^o 
behavtorally dotmod beforehand 


Sannpling is made from the larger 
task doma m 

t 


Criterion levels c<ir^ oe set at ??acn 
content level of a program and 
muit specify nimnnal leve*s. of 
competence 


4 Scores 


Variability among scores is 
essential 


Variability »i irrelevan^^ 


Scores can mask what an mdivi- 
duo! can do l)jt provide u)dic«i 
t'on of h)i reMtfV'j stoftdmg 


Scorei must rcflf^t tnoi mask) 
what an .r. dividual cun cr <.jonot 
do 


5 Type of 
Ranking 


Uie of aq** and ^^r<•ld^^ norms 
pOfCentilCi starviird scores 


Percentage passmg a cnterion Icvtl 
Pass/fail mforaiation on each item 



Table Tl — ^Characteristics of Norm-.Ref erenced and Criterion- 
Referenced Testa (Boehm, 1973) ^ 
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Uses for Criterion-Referenced Measurement 

? 

An important consideration in deciding which type of 
measuremeut to use is the use of the scores. Although both 

e 

CRM and NRM provide data for decision making a.bout individuals 
aad treatments, the context with which 'decisions are made 
determi^'e which to use. 

NRM should be "used if there is some degree of 
selectivity^ necessary , such as a limitation to the number of 
people that can be admitted to a training program. (Popham' 
and.Husek, 1969) 

CRM should be us^d to make decisions about individuals 

and treatments in other situations. A criterion measure 

could be used to determine whether a person has mastered 

certain skills considered a pr<erequisi te to starting a new 

training program. A criterion measure reflecting a set of 

instructional objectives could be used to evaluate two 

different instructional sequences to ^determine which is 

more effective. If competencies possessed by an individual 

is needed before instruction can be provided, CRM should be 

ulfed. (Popham and Husek, 1969) 
*» 

Other suggestions have been made for using CRM. 
Coulson and Cogswell (1965) discussed the need for it in 
regard to the use of programmed materials. Glaser and Cox 
(1968) suggested the use of it in individualized instructional 
models where, evaluation instruments must differentiate 
between groups of pupils who have mastered' certain units 
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and those who have' not. Jackson (1970) concluded tha/t CRM 

would be desirable in the' areas of diagnostic informtion, 

formative evaluation of training programs, and the evasluative 

assessment of individual and group achievement. Freme^ (1972) 

suggeste'd thut CRM is meaningful in relating performance to 

significant real -life cri teria such as minimal competency in 

a ba^ic skills area, such 'as math for an accountant. Thronton 

and Was dyke ( 1972) advocated ' its use in performance -based 

pva.lua'f ion for job protnotions ^.nd certification , such as in 

''The New York City Police Study for Promotions'* and the 

National Teacher Examination in Industrial Arts. 

Garvin (1971) has suggested that different levels 

of proficiency standards be established for certain occupation 

^arl tasks. I^/certain tasks, by their v^ry nature, must' be 

performed at a ^pecifialDly hj[%h level, than an absolute. 

criterion level* should be established and m^ by all. For 

r 

example, landing an aircraft, or compcrtinding a prescription ' 
must be done correctly or public safety wguld be endangered. 
However, there are other tasks where some latitude of com- 
petence is permissible; such as running a lathe, selling a 
product , and typing . Different levels of proficiency could 
be established for, these relative tasks. 

Garvin (1971) further set forth some general prin- 
ciples regarding the applicability of CRM to various content 
areas and levels: 

1 . Unless at least one of the instructional ob ject- 
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ives of a uni.t envisions a task thar^nust be subsequently 
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be performed at a specified level of competence in at 
least some situation, criterion-referenced measurement 
is irrelevant because there is. no, criterion. In this 
sense the entire sequence of 'social studies* provides 
no meaningf ul^criterion except, possibly the entry 
level for certain * honor' courses. 

2. If, public safety, economic responsibility, or 
otjier ethical considerations demand that certain tasks 
be performe'd only by those 'qualified' for them by formal 
i-nstruction , then CRM of the outcomes of 'such instruction 
is clearly indicated. The criter^ion her^'is the licensing 
standards of the profession involved. All professional 
instruction in the_ medical arts, law, finance, engineer- 
ing, and the applied"* physical and social sciences general- 
ly is clearly in this category. Teaching — :at any level — 
ought to be. However ,' entry to such professional training 
is typically based on NRM since training capacity imposes- 
a 'quota..' 

3'. In any instructional sequence where the content 
is inherently cumula'tive and the rigor progressively 
greater, CRM shouJLd be used to control entry to successive 
units. However, if there are sevteral different sequences 
differing widely in rigor, NRM is more useful in making 
appropriate placements . 

4. There are certain content areas to which criteria 
do apply^ but not everyone need meet then. These are the 
'required subjects', everyone must ,try to learn them — if 
only as a matter of' public policy — but it is almost pre- 
ordained that some of them will not. Home economics and 
physical education are relatively non-controversial 
examples ,at the secondary level;- at the college level, 
these become profeafsions and CRM applies, '(pp. 62-63) 

Most test experts stress, however, that both Criterion 

referenced and norm-referenced measures are needed to make 

valid and enlightened decisions about individuals .and programs 

(Simon, 1969; Swezey, Pear Istein , ^^nd Ton, 1974) 



Writing Criterion-Referenced Tests 

The areas of writing'CRM's and evaluating criterion- 
referenced tests are in the developmental stage • Many people 
have written articles hypothesizing how to, write a criterion- 
referenced test and evaluate it in terms of validity, 
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reliability, and other test parameters. However, there is need 
for developing a CRM test theory. (Bo*ehm, 1973; Glaser and 
Nitko, 1971; Hively, 1974; Jackson, 1970) In a 1974 poll 
of its members., the National Council on Measurement in Ed- 
ucation found th^at the development of a test theory for CRM - 
was ranked number three in its priority list for research in 
measurement. The following two sub-sections discuss various 
.writings in the field. ^s^^ 

An important concept to be cognizant of when writing 
a CRM, is that of a criterion. Although most writers do not 
emphasize the theoretical basis, for criteria, Goldstein (1974) 
has pointed out that criterion relevancy, deficiency, and 
contamination are important concepts to be aware of. Nagle 
(1953) stated that a criterion is more relevant when the 
criterion measure is closer to the true criterion. Thorndike 
(1949) emphasized that the criteria are more relevant if the 
behaviors learned in the training program are the same as 
those required for success at tbe ultimate task. (Goldstein, 
1974) 

Since Travers (1975) has covered behavioral objectives, 
it is -suf f icifent to say here that after the organizational 
needs assessment and task analysis, behavioral objectives should 
be written. Most CRM people have' used Mager^s (1962) format. 
These objectives must be translated into specific test t^^^ that 
form the basis for inference that the behavior has been 
acquired by the trainee if successfully completed. 

Recently, much work has been done in the analysis 
and classification of "^behavior in training and education, and 

\ \ ^ L 
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this has been^ helpful in analyzing performance into component 

'J 

tasks. (Bruner, 1964; Gagne , 1965; Glaser, 1962; Melton, 
1964; Miller, 1965) Other studies (Gane and Woolfenden, 
1968; Gibson, 1965; Hively, 1966; Newell and Farehand, 1968) 
have dealt with exam.ining the specific components and the 
sequence, of performance of a complex behavior so that the 
task domain can be identified for training and testing pur- 
poses. 

Specifying the domain of tasks requires a systematic 
procedure. Hively (1968) has developed one method to delimit 
and clearly define the domain of tasks through the use of an 
"item form/' Table 2 contains examples of item forms for 
subtraction tasks in arithmetic. A title in the left column 
contains a task of the subtraction domain. Next, a sample 
problem is shown as it would appear on the test. The last 
two columns Contain the general form and generation rules 
which, define the. tasks,, A collection of item forms constitute 
a domain from which test items may be drawn. Using item forms 
it is'easy^to make judgements about the content validity of a 
cri terion^ref erenced test, or in fact, any kind of £est, 

O&burn (1968) , who has developed a similar item form, 

discussed ^ two conditions that are prerequisites for allowing 

inferences to be made about a ^domain of skills and knowledge 

from performance on a sample of i tems : 

The first is that all items that could possibly 
* appear in the test should be specified in advance. 
Secondly, the items in a particular test should be 
selected by random sampling or stratified i^Andom 

- ' 2? 



sampling from the universe of content. (p. 96) 
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1. A^\o, B=-b 

2. (o<b)GU 
3 V| 

1. A = oio:, B = b 

2 o.Eu-in 

3 lb>0;)EU3 

2 A =o^oj • ^ B«b.b; ■ 

3. (o,>b.), (o,<b.), 
(c4>bt)Ei^c 

4. b:G^5 
5 o:-^0 

6. Pj{i, 2, 3), Mh 

1. A = 0 £ = bjb; 

2. 0;£U 

3 01, bi, b:GU6 

4 Check 0<B<A 



' ExpUnation of ^c'**"^*^ 

Ciptul ietico A, I. • • • rrpr'-fcprit numerali. 

Sm*U letters (with oj without Jobicnpii) o. b. e.. b:. ctt. rcprc«nl digiU. 

»6 1 • • • I Cho<nc St random « repIa-Xfr^cni for « from the £»%wi $<t V 

o. fa. C 6 I • ' • 1 AH of 0. b. c are cbo«n from the gi^en set »iM repiacemeni 

Nf, Number of cJigili m numeral A 

N Numbcf of dis«i in cjch numeril in the proWeni. 

•I. • •€(••') Generate all the noxM^ry In jencral mc*ru continue the pattern csubhjhcd. 

(•<b) € I • • • j Choose two numbcn at random w^ithout rcpUtrmenc. let a be the smaller 

fH. Yt. Chooie a honxonts! or venital fortnjt - i_ j 

f tA. 1. • • • I Choo^ a permuuiion of ihe elemeniiin the set (If tbp «t conttJii of subscnpti, permute ihoic «ubKnptc<J dcmcnli) 
Set ope/auoiu arc used ai normally deftr^J Note that A-l-AOi Ordered pairs ire also used as usual 
Chfck If a check ti npi fiii6kled»^cnerate all elements involved in the check statement (and ^ny elc.'nents dependent upon ihem) 

Spcw«i scu 

M. 2. • • • ,91 
Ui- 10. I. • • • . 9) 



Table 2. Examples of itjem forms from the subtraction universe 
developed by Hively, (Hively, 1968) 
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Jackson (1970) Staged that . . . the difficulty of 

objectively defining a te^t..-c6^s true t ion process is directly 

related to the complexity of the behavior the test is de^^gned 

to assess." (pi 7) Thus, the first of Osburn's conditions 
t * 

would be difficult to satisfy for complex domains. However, 
listing the elements of a universe of item, content can be 

. / . 
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overcome, to a certain extent, if ^ generative process could 

be defined which could, in theory, produce such a listing. 

Through theiise^of the item form, it is possible to produce 

such a genera^ ve process. (Hiv^ly, 1968; Osburn, 1968) 

Osburn (1968) has described the characteristics of , 

. an item form as follows: 

. . . (1) it generates items with a fixed syntactical 
struc ture ; ' (2) it contains one or more variable structures 
and (3) it defines a class of item sentences by specify- 
ing the replacement sets for the variable elements. (p. 96) 

Using the item form method, there is an "unbroken 
^ link" between the generative system and the specific item 
produced. A collection of item forms, together with the 
replacement sets for the variable elements, the n dc fine a 
universe of content . In addition to the numerica 1 type of 
Hivelyjs, Osburn has developed verbal replacement sets and 
a hierarchical rr an gerr.ent of test tasks to be generated . 

An item form could consist of a sentence with one or 

more blanks, aad the words or numbers that fit into the blanks 

could be systematically varied to produce items of different 

levels of specificity. Since this procedure is systematic 

and rule bound, it has been adaptable to coi^uter programming. 
/ ^ 

^Ferguson , 1969) Shoemaker and Osburn {19^) have constructed 
a" computer program "... capable of generating random or 
^^tratified random parallel tests, from a specified content 
population . " ( p . 165) 

An example of a sentence frame for the input of a 
computer program would be: 
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Given a normal dis.tribut ion with a mean equal to^ 
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and a standard deviation equal to . If one number 

IS randomly sampled from this* distribution, what is the 
probability that this number will be greater than or 
equal to ? (Shoemaker and Osburn, 1968) i 

The blanks in the form are filled in by a random number gener- 
ator, which can be controlled to supply realistic problems 
and reduce difficult and long computations. (Shoemaker and 
Osburn, 1969) 

Bormu th ( 1970) has advocated that the tests that use 
"NRM procedures cannot unequivic^lly claim to r^resent thp 
properties of instruction nor can they be objectively 
reproduced . A norm-referenced test item, Bormuth wrote , is 
a property of the test writer and not a property of^instruct- 
ion. A score on a norm-referenced test is the learner ^s 
'responses to the writer's responses to instruction, or, m 
other words, the constructor's behavior. 

Ebel (1962) reaffirmed Bormuth's beliefs. 

Specialists in educat'ional measurement generally 
recognize that most objective tests rest on highly 
subjective foundations. The abilities, values, and 
idiosyncrasies of the test constructor have played a 
major part in determining 'the content of most tests. 
Test specifications sometimes exist only in the mind of 
the test constructor or in a few bri-ef written guidelines. 
^ When written, they often have more to say about the form J 
pf the test than about its content . * (p. 22) 

Bormuth (1970) has suggested that a linguistic 
analysis be used to explicitedly translate instructional 
objectives into test i terns. Like the item form, this would 
introduce more objectivity and replicabi lity into test writ- 
ing. 
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. As Swezey, Pearlstein, and Ton (1974) have shown, 
there are many studies going oi\ with CRM in different 
areas. One of the more extensive studies on criterion- 
referenced testing was done by Thornton and Wasdyke (1972). 
These test specialists have developed '/The Taxonomy of. 
Behavior' for Career Development and Measurement'* which pro- 
vides a framework for the logical tracing of observed. be- 
haviors from the processes of job ^nalys-is,. through' rest 
development, performance evaluation to validation. The 
taxonomy can be used to write comprehensive test specif ica- ' 
tions for simple to complex ranges of behaviors. 

There are five steps in Thornton and' WaSdyke ' s (1972) 

method; 

1. Job (task) analysis and specification (in task 
Analysis statements) . 

2. Translation and classification of task/^ analysi-s 
statements into behavioral objectives . r 

3 . Def ini tion of the job* performance stan'dards into 
behavioral terms > ' ' 

4. Mul ti-dimen^onal test specif icatiorj^jm^ develop- 
ment. 

5. . MeasLW^^^inentvbf^e;rf o^ — validity ( transla.tion 
of occupational tr^st items into behavioral objectives) 
(p. 3) 



The above procedure was used for an examination 
constructed by the Education Testing Service for police 
promotion procedures in Ne^ York City. 

The first step in* the above process results in an ^. 
ordered collection of task statements which describe the 
duties and responsibilities of a job. The second step trans- 
lates task statemQnts into behavioral objectives, i'hdicating 
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the condition, performance, an^ extent. (This is very 
similar to Mager^s procedure for writiYig behavioral objectives. 
This results*' in a list of behavioral objectives required for 
acceptable job performance. Each objective is then described 
in terms of the cognitive acW-vity, the affective mode 
necessary, and the psychomotor skills required for satisfac- 
tory job performance. AXter this, eacli objective is classified 
within a three-dimensional, 90-cell taxonomy of behavior aiid 
this serves as a blueprint of the terminal objectives of the 
process selection (predictioa) , training (education ^nd career 
development) , and evaluation (performance) . (T^^rnton and 
Wasdyke, 1972) 

In defining job performance standards, judges are 
used to determine what p^recisely is the minimal acceptable 
job performance in terms of that behavior. The results of 
ifhis process are twofold: minimum acceptable behavior for 
developing a test for minimum competency ; and, the precise 
lower limit of acceptable job performance specified in ^ 
behavioral scale. (Thornton and Wasdyke, 1972) 

The final behavioral objectives can be used to write 
mult i -dimensional test specifications . The specifications 
include the behavior to be measured in the test and the 
' precise level or levels within the ta:^onomy which most appro- 
priately measures the required job behavior., (Thornton and 
Wasdyke, 1972) ' - ' ' 

The last step of measuring performance and validation 



is logically deteinnined in two ways: 

1. The translation of the test items into behavioral 
objectives, their classification by means of the taxonomy, 
and their comparison, objective by objective, with the 
original task derived taxonomy . 

These Operations are performed by researchers other 
than the job analysts and test, designers. 

2. The comparison of candidate's performance, behavior 
by behavior, on the test and as rated by supervisors on the 
job. (Thornton and Wasdyke, 1972, p. 12) 

An example of translating a test item into a behaviar- 

al objective would be: 

Condition: Given witnesses to a crime in a physical 
situation in which they cannot be separated from each 
other. 

Performance: Predict .the effect of this situation on 
the information gathered from these witnesses in two areas, 
the sequence ojf the events and the description of the per- 
petrator . 

Extent: Accuracy of prediction of 100% based on 
correct answer in each case. (Thornton and Wasdyke, 1972, 
p. 12) 

This item objective would be traced through the 
taxonomy back to the original objective and accepted perform- 
ance standard. 

- ' Thornton and Wasdyke (1972) have expressed that this 

< 

logical validation is not a substitute for statistical valid- 
ity, but a supplement to traditional methods. 

These methods are some of the recent dfiWlopments in 
criterion-referenced' test construction. The major goal of all 
of these methods is to be able to allow inference from test 
performance to behavioral referents. All items are specified 
by rules and there is the advantage of being able to randomly 
sample items from a specified universe of content.. Work is 
being carried on by^seyeral universities and test services 
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to refine these methods. (CTBAMcGraw Hill; Educational 
Testing Services; Army Research Institute) 

Evaluation of Crlterion-Ref eirenced Tests 

After defining the universe of content and construc- 
ting the item forms, the final form of the test must be 
constructed. Item selection and analysis have been well- 
developed for NRM but not for CRM. While NRM depend on 
variance in the test scores, CRM may display very little 
variance. (Popham and Husek, 1969)\ 

For example, if a training program for sewing 
machine operators seeks to reach a certain level of com- 
petence, a pretest-post test experimental design could be 
loused. Scores on the pos1:test should show an increased mean 
performance and a decrease variance since all trainees are 
expected to acquire knowledge and skill mastery of sewing 
concepts. (Popham an'^ Husek, 1969) 

It should be noted^t this point, however, that using 

CRM's do not limit achievement or competency beyond a certain 

performance level. As Glaser and Nitko (1971) have stated: 

In theory , adaptive instruction seeks to ensure that 
all individuals in the population show certain levels of 
mastery in the instructional domain , .while not excluding 
differences in achievement beyond the general level of 
mastery established. (p. 659) ^ 

Concerning the evaluation of CRM* s , measurement 

specialists cast doubt on applying the conventional empirical 

evaluation procedures of the mental test theory for assessing 

reliability, validity, and analyzing 'test items. With NRM's, 
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the more variability, the^better since the purpose of the test 
is to spread individuals out-. However, with CRM's, variabil- 
ity is irrelevant. The meaning /o^f the score flows directly 
from the connection between the items and the criterion. 
(Cox, 1971) 

The subtle implication of this central difference is 

tA^t all traditional theories and formulas for determining 

wha^t a "good*^ test is can no I'onger be used with criterion 
measures. Most of the formulas for test adequacy indices 
rely on the concept of variability. (Popham and Husek, 1969) 

Specialists have stresses that a criterion-referenced 
test may be a good test even if there is no variance in the 
population's scores. Indeed, with some criterion tests, it 
may be that all students will pass every item ! (Cartier, 
1968) 

Validity . Tuckerman (1972) has defined validity of a 
te'st as . . the extent to which a test measures what it 
purports to measure." (p. 139) For example, a test on 
rep^ring automobile ignitions must be a true indication of a 
student's skill and knowledge aX automobile ignitions, and not 
mathematics or reading. 

Validity, which is essential for any good test, has 
been defined in many ways throughout the years, Gulliksen 
(1950) has stated, "The validity of a test is the correlation' 
of the test with some criterion." (p. 68) Cureton (1951) 
wrote,* "The validity of a test is an estimate of the correla- 
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tion between the r^aw test scores and the *true' criterion 
scores. (p, 625) Lindquist (1942) has defined validity 
as . . . the accuracy with which it measures that which it 
is intended to measure, or as the degree to which it appro- 
ches infallibility in measuring what it purports to measure,'' 
(p. 213) Edgerton (1949) has stated, ''By 'validity' we refer 
to the extent to which the measuring device is useful for a 
given purpose. (p. 52) Cronbach (1960) has advocated, "The 
more fu'lly and confidently a test can be interpreted, the 
greater its validity." (p. 1151) 

There is a conceptual similarity between these state- 
ments, but there is also some distinctive differences. The 
first two deal with correlations, the third avoids statistics, 
the fourth stresses utility, and the fifth relates to inter- 
pretability of the test scores. 

The American Psychological Association has identified 
three basic types of test validity. Content validity is the 
extent to which as test measures a representative sample of 
the subject matter content and the behavioral change under 
consideration. Cri terion^related validity is the extent to 
which test performance is related to some other valid measure. 
Construct validity is the extent to which test performance can 
be interpreted in terms of certain psychological constructs. 
(Gronlund, 1971, pp. 78-90) 

The last two procedures for assessing validity* are 
based on correlation and thud ' variability . Hence, they would 
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not be too accurate for CRM^ . Conteijt validi ty , • by its 
very nature, is the most suitable to validate a criterion- 
referenced test. r(Swezey, Pearlstein, and Ton, 1974) 

Content validity i,s best evidenced by comparing the 
test content to the universe of content and behaviors being 
measured. Mehrens and Lehmann (1969) stated that this is 
accomplished "byr' 

. . . a compari'Son of the test content with courses 
of study, instructional materials and statements of 
instructional goals, and by critical analysis of the 
processes required in responding to the items. (p. 310) 

Test experts have used different methods to verify 

content validity. Popham and Husek (1969) have suggested 

that the general procedure for validating a CRM would be 

judgement . . . based _on the test's apparent relevance to 

the behaviors legimately inferable from those delimited by 

the criterion." (p. 6) Osburn (1968) stressed that a CR^M 

must have content validi ty\ built intcKy it because: 

What the test is measuring is operationally defihed 
by the universe of content as embodied in the item 
generating rules . No recourse to response -inferred 
concepts such as construct validity, predictive validity 
underlying factor struc ture or latent variable is 
necessary to answer this vital question. (p. 97) 

Content validity can also be determined by using 

Hively^s (1968) item form, which consists of a complete set 

of rules for generating a domain of t^t items for a specifi 

objective. Independent experts or *«^es are used to decide 

whether or not a test item is congruent with the highly 
x 

specific behavior domain explicated by the item form. 
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Thornton and Wasdyke's (1972) method of validation, 
.by rewriting a test item into a behavioral objectiye and 
tracing it back through a taxonomy to the original objective, 
is another way to check validity. Fremer (1972) , who has 
suggested a variety of methods using a panel of judges to 
•validate a test, summed up the feeling of most measurement, 
people by stating, "More than one method should be used to^ 
validate any desired criterion-referenced inference/." (p. 
28) 

Reliability . Like validity, there are nfany ways to 
describe the reliability of a test. One general definition , | 
of reliability is " . . . the extent to which a test is con- 
sistent in measuring' whatever it does measure." (Mehrens and 

Lehmann, 1969, p. 368) ^ 

Since most of the methods for estimating reliability 
are dependent upon variance, they cannot be used for^CRM's ^ 
with complete confidence. For example, one of the most common 
ways to determine internal consistency is by using the Kuder- 
Richardson formula which relies on' score variance. (Tuckman, 
1972) However, if everyone on a CRM obtains a perfect score, 
the internal consistency estimate would be zero, which indi- 
cates poor reliability. CRM advocates state that such a test 
should not be assumed to.be poor. Ii/ f act , it is possible fcfr 
a CRM to have a poor internal index and s-till be a good 
measure. (Husek and Sirotnik, 1968) ^ 

Other typical indices, such as the split-hal-v^s 
method, are also riot appropriate for an internal consistency 
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estimator. • 

Concerning external consistency estimates, these 
^e also cloudy wlien used with CRM's. Reliability can be 
measured 'by giving the same people the same test on more 
tYuxn one oc das ion and then ^comparing each person * s per- 
formance on /both testings. (Tuckman, 1972) However, this 
test-retest( corr elat ioiT coefficient, dependent on variabil" 
ity, cannot be used either. Popham and Husek (1969) have 
said that a high inter-item correlaJ:ion and test-retest 
correlation is fine and these indices ^can be used to support 
the consistency of the test. Hdwever, a criterion measure 
could be highly consistent and yet indices dependent on 
variability might not -reflect that consistency. 

Jackson C1970) has proposed a comparison of the scores 
on* two forms of a CRM measuring the same material since 
criterion-referenced tests should be able to be generated 
independently and objectively. An index of agreement between 
the two formS could then be used. 

> ' Cox and Graham (1971) have illustrated another way 
reliability might be viewed using a sequentially scaled test. 
Theoretically, the test is constructed so that the student 
answers all items up to his level of attainment and misses 
all x.tems beyond this cer^^ain point. The test uses a Guttman 
scale, the total spore indicating the individual's response 
pattern. A coefficient of reproducibility is found that 
indicated how well an individual's response pattern could be 
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reproduced from knowledge of this total score. This coeffic- 
ient might be used as a type of reliabilit^y estimate across 
all individuals taking- the test. 

Livingston (.1972a) has proposed a controversial 
classical test theory approach to CRM, whereby the psych' 
metric theory of true and error scores^are used to find 

V 

the reliability . Livingstori;_has stated ^tha.t. when using 
CRM, one wants to find out how far a score deviates from a 
fixed standard. Thus, he has suggested >using deviations from 
a criterion score instead of a mean score (as in NRM) , and 
defines CRM reliability as a ratio of mean squared deviation 
from the criterion score. 

Oakland (1972), Harris (1972), Meredith and Sabers 
(1972), and others have taken issue with Livingston's model. 
(For a discussion, see Swezey, Pearlstein, aid Ton, 1974) 

Swaminathan, Hambleton, and Algina (1974) have 

proposed that ci terion-ref erenced reliability be defined as: 

... a measure of agreement over and above that which 
can be expected by chancp between the decisions made about 
examinee mastery states in repeated test administrations 
for each objective measured by the criterion-referenced 
test. (p. 263) 

These specialists believe that the primary purpose of CRM 

is to classify individuals into mastery categories on the 
objectives covered by the test. They emphasized that using 
their method will result in as many reliabilities as there 
ard objectives covered by the test. 

The area of reliability needs much more research and 
-discussion before an index is accepted in the field. 

( 4U 
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Item analysis . In test item construction of a NRM, 
a test writer wants variability, so questions that are too 
hard or too eaSy are discarded. , T<he CRM tesjt writer is mainly 
concerned with making yure that the test items accurately 
sample the range of criterion behavior being measured. The 
items must possess congruency with the class of eligible be" 
haviors as prescribed by an instructional objective. The items 
can be difficult or easy, discriminating or indiscriminat ing, 
but must reflect the domain of Relevant tasks. (Popham, 1971) 

'After the items have been formed into a test and re-- 
suits received from administering it, there is the procedure 
of analyzing and improving it. With NRM, item analysils 
techniques have been used to identify those items that were 
not properly discriminating among individuals. (The discrim- 
inating power of a test item is the ability to differentiate 
between persons possessing much of the same criterion trait 
and those possessing little of the trait.) Nondiscriminating 
items, or tho^e not separating the more knowledgeable from 
the less knowledgeable, are usually those that are too hard, 
too easy, and/or ambiguous . (Ahmapn, 1962) 

Osburn (1968) has' made the following observation about 

traditional item analysis techniques as applied to CRM: 

It is evident that these procedures may bias the in- 
ferepce regarding a person's true score on the universe 
of content, -and the nature of the bias will generally be 
unknown. . . . Rejection of the item always implies re- 
jection of the class of items to which the item belongs 
or at least a modification of the generating rule that 
specifies the item class. (p. 99) 
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Jackson (1970) remarked that it is difficult to see 
how item selection could legitimately be influenced by item 
analysis data because the comparability of test scores and 
behavioral standards are postuated upon a systematic sampling 
o,f tasks from a universe of content. 

However, other pe4ple say that there i> some value 
in item analysis techniques. Popham and Husek (1969) have 
suggested that a nondiscriminating item should remain on the 
test if the item reflects an important attribute of the 
criterion . 

Gronlund (1965) reaffirmed this by writing: 



' . . . a low index of discriminating power should^ 
al^rt us to the possible presence of technical defects ^ 
in a test item but it should not cause us to discard an 
otherwise worthwhile item. A well-constructed achieve- 
ment test will, of necessity, contain items with low 
discriminating power and to disc$ird them wou^Ld result in 
a test which is less, rather than tobre, valid. (p. 214) 

Popham and Husek (1969) proposed that a positively 
discriminating item is a good quali.ty to have on a CRM, and 
naturally should be kept. However, a negatively discrimin- 
ating item, one which is answered correctly more often by the 
less knowledgeable than by the more knowledgeable, should be 
treated in the same way on both types of measures. It should 
be revised or. thrown out^. 

Cox and Vargas tl966) investigated two indexes of a 
CRM item's ability to discriminate between pre- and post- 
instruction^rtT^erf ormance . One index was computed by subtract 
ing the pei^ntage of individuals who passed the item on the 
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pretest from the percentage who passed it on the posttest. 
The other method was the common upper g^^p minus lower groUp 
discrimination index. The researchers concluded that the 
traditional way (upper minus lower group) could not be used 
but that the pretest-posttest method should warrant consid- 
eration when using CRM. ^ 

Jackson (1970) stated that two groups of people 
could be used with Cox and Vargas's procedure as long as 
one was known to have mastered the behavior domain to a 
greater degree than the other. It would be necessary to 
revise the domain under which a test was developed if certain 
items were non-discriminating between groups. This type of 
analysis could also be used as an empirical check on the 
validity of the hypothetical constructs that the test intend- 
ed to measure. 

A related concept to reliability is the length of the 
test. If CRM is used to evaluate a program or treatment, the 
same tests (or an equivalent form) need not be used. Cronbach 
(1963) and Husek and Sirotnik (1968) have shown that the con*" 
cept of item sampling in which different people complete 
different items is highly appropriate in evaluating the 
adequacy of treatments. Thus, ther^ could be a sampling of 
more behavior with shorter tests by constructing different 
forms to be administered to individuals in the treatment group. 

In summary, traditional item analysis methods can be 
used with CRM's, to a certain extent, but it must be remembered 

*> 
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that discrimination is only a warning flag. Even if an item 
is^ n-egatively discriminating, it may be caused by an instruct- 
ional deficiency or the presence of ambiguity, clues, and other 
technical defects in the item. (Gronlund, 1965) More develop- 
mental work on item analysis procedures, especially when only " 
one test administration is possible, is needed. 

Reporting and Interpretation , Flanagan (1951) has 
said that . . . test scores are meaningful and valuable to 
the extent that they can be interpreted in terms of capacities, 
abilities, and accomplishments of educational significance." 
(p. 695) Ebel (1962) has pointed out that something important 
tends to get lost when raw scores are transformed into standard 
scores. ''What gets lost is a meaningful relation between the 
^ scores on the test and the character of the performance it is 
supposed to measure.'* (p. 17) Ebel has advocated the use of 
**content standard'' test scores by ^building meaning into the 
test, and hence into the test score, by a systematic, explicit- 
ly specified process of test construction. 

Both NRM and CRM aid in making decisions about individ- 
uals and training treatments. The methods of norm-referenced 
reporting^are through group-relative descriptions such as 
ile ranking and standard scares. Thus, by a single 
it is prossible to tell how well an individual performed 
tion to the group. (Seashore, 1955) 
-When interpreting an individual's performance on a 
CRMi , group-relative indices are not appropriate. Criterion 

« 
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tests yield scores which are essentially "on-off" in nature. 
Tha^ is, the student has either mastered the criterion or has 
not. (Popham and Husek, 1969) In practice, however, a range 
of acceptable performance exists so several on-off scores 
should be established. (Garvin, 1971)^ 

If an instructional objective of a carpentry training 
program was to be able to identify different types of hand 
tools used in carpentry a 20-'item objective test could be 
constructed and a required proficiency level set. The 
experts may set the minimum proficiency level at 90 percent, 
thus allowing er-ror on 2 of the 20 items. In reporting an 
individual's performance, one alternative is that the person 
has re3.ched tfie minimum cut-off score (90 percent) or has not. 
If the level- is not met, the individual could/not move on to 
the next topic, and remedial instruction would be. needed. 
(Popham and Husek, 1969) 

To report the degree of less than cri terion leve 1 
depends on the use of the test scores. If, for example, there 
are two kinds of remedial programs available, one for-those 
close to criterion, and one for those far froc^gri terion , the 
degree of performacne would be appropriate to report. (Popham 
and Husekj 1969) 

Using CRM, the .number of individuals who achieved ^ 
the pre-established criterion level could be reported. 
Although this seems to be little data, it tells -exactly the^/ 
proportion of learners wh(^ did not achieve the criterion level. 
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The, traditional descriptive statistics, such as means and ' 
standard deviations, could still be used since it is necessary 
to know the average performance produced by the treatment 
in addition to its variance. (Popham and Husek, 1969) 

Millhian (1972) advocated a new grading system based on 

CRM: 

When criterion-referenced measurement is used to 
guide and monitor the instructional program, it is a 
logical next step to have the learner's grades consist 
of check marks opposite instructional objectives which 
indicate which skills and understandings have been 
acquired* (p. 280) 

The examples of Job Corps Training Achievement 
Records (1974), found, in Appendix A, are similar to -what 
Millman has called- for. 

For a more, complete and theoretical, discussion of 
developing and analyzing CRM's, see Swezey , Pearl^tein, 
and Ton (1974)- and Sweiey and Pearlstein (1974). 

The Application of Cri terion-Ref erenced Measurement 

• ^ There have been many related areas and spin-offs from 
the CRM movement, including mastery learning /(Block , 1971),- 
domain-referenced testing(Hively , 1^68, 1974), objective- 
referenced testing ^ (Baker, ';L9r2) , performance testing ' 

(Osborn, 1974), and competency-bas^d education (Burns ^ 

# 

and Klingstedt, 1972). 

Prager et al. (ia72) designed a CRM program called 
Individual Achievement Monitoring System (lAMS) for use with 
handicapped people; Popham (1973) used CRM In teacher 
performance testing. In the Experimental Volunteer Army 
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Training Program, Taylor, Michael, and Brennan (1973) 
used performance tests for different military occupations. 

An instructional innovation that has incorporated 
the use of CRM has been the systems approach to curriculum 
design. The following section illustrates how Butler (1972) 
has proposed CRM should be utilized in designing vocational 
and technical training programs. 



Butler ^s System 

Butler (19.72) , a vocational educator and currently 
director of curriculum research and development at the 
New England Resource Center for Occupational Education, 
has developed the training systems model shown in Figure 1 
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Conduct feasibility study . The first step in 
Butler ' s system is an analysis of ^trends with regard to 
job markets and occupational patterns; trends in economic, 
business, agricultural, and industrial expansion; types 
of jobs and worker competencies needed; ayailability of 
training programs and facilities, and their costs; and 
other related information. 

Conduct task analysis . After the decision has been 

made that a specific training program or course is needed, 

a job/task analysis is conducted. The job/task analysis 

is the foundation upon which the training objeptives, 

content, sequence, methods, media, and evaluation are based. 

The job/task analysis is a summary of the behavioral content 

of a job broken down into duties, tasks, activities, and 

actions. Each task, which is "a logical and necessary step 

in the per*<^rmance of a duty" (p. 74), should be described 

in the following terms: 

•The cues, signals, and indication that call for the 
action or reaction. 

♦ The- control , object, or ^tool to be used or manipulated. 

$ The action or manipulation to be made . 

♦The cufes, signals, and indications (feedback) that 
the action taken is, or is not, correct and adequate, 
(p. 75) . ^ 

Working conditions, tools and equipment, and stand- 
ards of performance are necessary for each task. 

^ There are many possible sources oi information to 
•consult in writing a job/task analysis, such as training 
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literature, manuals, textbooks. The Dictionary of Occupation - 
^ al Titles , professional associations, trade unions, and 
governmental agencies. However, the most reliable and 
valid source is the incumbent worker. Morsch (1964) 
discussed seven methods of job analysis which could be used: 
the questionnaire-survey, individual interview, observation 
interview, group interview, daily diary method, work partici- 
pation method, and critical incident technique. 

Butler (1972) stressed that more . . . emphasis 
should be placed on observation and interview of the appren- 
tice or entry-level worker to find out what he actually does 
on the job . . . (p. 78^ 

Develop training ob.jectives . Based on the task 
analysis, the designer must derive explicit statements about 
what a. student, upon completion of the training program, will 
be able to do. Training objectives must be described in 
observable and measurable terms. Butler uses Mager '-s (1962) 
formula for writing objectives, whereby the conditions and 
limitations, overt behavior drsplayed by the student, and 
performance standards must be specified. Both terminal (unit, 
course, program) objectives and interim or enabling (lesson, 
activity, module) objectives must be specified. These may 
be directly coupled to broad goal statements and possibly 
even broader educational or philosophical constructs. 
. ' Develop criterion tests . Criterion tests are used 

in the early stages of design to determine- validity of the 



46 

objectives, and later to provide feedback and help per- 
form summative evaluations of the entire course or 
training program. 

Validate the criterion tests . In order to validate 
the criterion test it is administered' to an untrained- 
unskilled group and to a trained-skilled group and a 
correlation is computed to obtain validity and reliability 
coefficients. Test item analysis at this point calls for 
interpretations similar to the following: (a) if, for a 
given test item, the majority, of untrained group responses 
are co3>rect, the item has little or no validity or reliabil 
ityj and conversely, (b) if, for a given test itefli, the 
majority of» trained group responses are incorrect, the item 
likewise has little or no validity of reliability. 

Validate training objectives . The criterion test should 
contain at least one item for each objective, but no more 
than five items for each objective, otherwise the test be-- 
comes too long for practical purposes. Validating the 
criterion test and va]^idating training objectives can be 
accomplished concurrently, provided the test item itself is 
not at fault. Interpretations similar to those made in the 
preceding step are employed in^is step; e.g., (1) if, 'for 
a given test item and its companion objective, the majority 
of untraine^^group responses are correct, there may be no 
need to include that objective in the curriculum; and, (2) 
if, for a given test item and its companion objective,, the 
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majority of trained gr^up responses are Incorrect, there may 
: be no n?ed to include thai^ objective in the course because, 
apparently, the worker can perform on' the job without that 
knowledge or skill. According to Butler's model, the 
initial design phase has been completed at this point, but 
the remaining phases also require validation considerations. 
Develop learning sequence . The determination of the 
^earning sequence is done according to the duties, tasks, 
and activities provided in the job/task analysis. The 
following chart shows a pyramidal form of learning structure 
and sequence. 



Job 



r 

Duties 1 



Tasks 1.1 1.2 1.3 2.1 • ^ 



. . . . I ^ I I I I I ( — I — I i — ^ 

Activities l.l.l M.3 12.1 IJLI U.t tt.! XIJL 2.1^ 2JL2 GtC. 

Figure 2. Pyramidal Form of Learning Structure and Sequence, 
(From Butler, 1972^, p. ir4) 

Activities, tasks, and duties are structured (and learned) 
in both a vertical and horizontal sequence. The learning of 
•one is dependent upon accomplishjnent of those which precede 
it. Most curriculum. experts recognize that sequencing must 
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be approached with a great deal of flexibility. ~ The general, 
guideline of efficiency should influence sequencing. 

Butler set forth a matrix analysis technique for 
preparing the course outline in which supporting knowledges 
and skills for activities, tasks, and duties are listed. 
The learning sequence can be plotted by starting with the 
terminal objective and working backward through 
preceding prerequisite — in essence, from the complex back 
to the simple. Butler suggested listing all terms, concepts, 
rules, and principles which pertain to eac^ object^e. Each 
number is^hen placed in a two-dimensional matrix (discrim- 
ination-association) along a diagonal line from top left to 
bottom right. Associations then are marked in the common 
squares above the diagonal, and discriminations are marked 
in the common squares below the diagonal. By shuffling and 
reshuffling, a' rearranged matrix can be plotted which depicts 
an optimum clustering of discriminations and associations 
around the diagonal, whifch results in the best sequencing. 
The clusters tend to depict broad concepts in the curriculum. 

Validating the sequence also is accomplished jwith 
the criterion test which has been validated and revised. 
The test is given to a group of trained individuals, i.e., 
as a post-test to persons who just completed the program, or 
to those who have been on the job about sijf months. In the 
analysis of these scores, one looks for the dependency and 
interdependency between and among units, lessons, or fairly 
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large blocks of curricula^ content. 

Butler indicated that tbe test data should be- analyzed 
with two basic questions in mind: (1) Did the majority of 
thQse students who correctly performed a subordinate unit 
also correctly perform the fpllowing and supposedly dependent 
unit?; . and, (2) Did the majority of those who -correctly 
performed the higher unit also perform the subordinate *unit 
correctly? If , for a tested trained sample, the answers 
to both questions are affirmative , then the se.quence is 
valid. If, for only 85% of the sample, the answers are 
affirmative, then the sequence is probably valid. The 
following chart provides a summary for analyzing criterion 
test data from a sample ttkined population. 



Tiamed Sample ^ Peflofmance linpiicitionj 
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50v f'H to pe'lornt lub unit 


Poiiibie incorrecv te<)ii(rnCe 


Peflo»mt tub unit <<*00%) 
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Table 3. Validat.ing Content Sequence. (From Butler, 1972, 
p. 125) 
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The foregoing procedure is used on a p^ir of 
tasks in a hierarchy. Suppose the hierarchy consisted of 
three or more tasks and validation is still required. 
Recent research has gone in the direction of trying to 
discover such hierarchies and their properties, and 
validation procedures are under study, using factor 
analysis techniques. The reader may wish to refer to 
"A Method for. Validating Sequential InstS-uctio^al 
Hierarchies,'' by P. W. Airasian, in the December, 1971 
issue of Educational Technology . Airasian' s method is 
based on calculation of conditional item difficulty indices 
and facilitates the pinpointing of sequential levels 
within a hierarchy which require reVision. 

Develop learning strategies . There are no feasible 
validation procedures for developing learning strategies 
which are not costly and time consuming to use. Media 
are selected according to those that wili do an effective 
job for the least cost. Combinations of the different 
media usually should be considered. 

Validation is. influenced by the media. Test scores 
may be low for students with reading problems, but the 
same test scores may be improved by using audio media 
instead of printedSnedia . The objcrctives and student 
learning styles are 'the prime determinants in developing 
the learning strategies. > 

Develop instructional lessons This is the point 
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where a test model of the instructional system is produced. 
Two documents are needed: (1) the system development plan, 
and (2) the instructor's manual or guide. 

The system development plan contains: (1) task 
analysis summary forms; (2) validated objectives in validated 
sequence, supported by a summary of the validation data; 
(3) validated criterion test' items in validated sequence, 
supported by a summary of the validation data; (4) outline 
of instructional strategies with associated content (object- 
ives) identified ; and (5) production and testing plans for 
the system. 

The design and format of the indi-V/idual le^^rning units 
may vary greatly, but each should contain the following: 

\ 

(1) the performance objectives; (2) the knowledges and skills 
to be gained; (3) a list of tools, equipment, supplies, 
references, etc., needed fo^ the unit; (4) a learning activity 
guide; (5) interim progress checks and student self-eval- 
ations; and (6) an instrument to serve as a pre-test and/or 
a post-test for evaluations by the instructor. 

Validate individual lessons . At this point, each unit 
is tested and revised until 85% of sample trainees reach the 
criterion. * 

^vision may require rese^uencing and adoption of 
new learning strategies. Initial test;ing is done on an 
individual or one-to-one basis, with two or three sample 
trainees who have upper-level ability. Minor revisions may 
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be made at this point; however, if major revision is indica- 
ted, two or three more individual tryouts should be conduct- 
ed. 

Small-group tryout is then conducted on 6 to 10 
students who represent the range of abi lity and background 
of the "target population. Criterion test data are again used 
to locate trouble spots and revision is made. At this point, 
85% of the studejii-trs^-s^iould be performing correctly on the 
criterion test . 



Final tryout is made on a large group of 30 to 50 
students under conditions which approximate actual training. 
This tryout is conducted by the^curriculum designer along 
with the instructor. A group this size is needed to verify 
or'validate pr^/ious design results. Final revision is 
made following^ this tryout. 

Impi^ment and fie Id test system . This is done under 
actual classroom ejbnditions. The instructor's role in the 
instructional system is explicated at this -point, and an 

, IT 7 

instructor's manual is developed. The teacher becomes a 
manager and facilitator of. learning and his tasks are as 
follows: (1) diagnose individual learning needs; (2) pre- 
scribe learning experiences ; (3) provide proper materials 
and equipment at right time; (4) test and evaluate individual 
progress ; (5) compile individual and group progress records ; 
(6) provide tutorial and counseling help: (7) provide 



motivational reinforcement; C8) provide supplementary . 
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materials and experiences; (9) coordinate individual, small 
grojap, and large-group learning activities; (10) coordinate 
use of learning materials and equipment; and (11) evaluate 
feedback data on effectiveness of learning. 

The instructor's manual should contain: (1) course 
description; (2) student population description; (3) per- 
formance objectives; (4) criterion tests; (5) system per- 
formance data; and (6) suggestions for administering the 
system. 

Field testing is the final phase of the systems 
development process. This means the program is monitored, 
evaluated, and subsequently revised continuously for as 
long as it is in use. ,This phase may be more appropriately 
referred to as system "institutionalization.'' Constant 
monitoring and anafysis of criterion test data will continue 
to point the way for needed revision. 

Butler pointed out that a training system is never a 
finished product but rather it is constantly in process. 

Follow-up on graduates . Effective guidance and 
placement are import aivT^i^ a systems approach. Longitu- 
dinal planning for fbl low-up at l-year , 3 -year , 5 -year , or 
10-year intervals should be started. Follow-up to obtain 
details of occupational, patterns, changes in needed 
competencies, job adjustment problems, and work satis- 
faction indices, all can be used as feedback to improve 
the instructional system. ~ 



Chapter 3 
SUMMARY AND CONCLUSIONS 




CRM, in g^n^Tral, is the assessment of an individual's ^ 
performance based on the degree to which his or her 
behavioral responses resemble the desired p|rfo]?mance or 
criterion at a specified level. The individual's score 
is directly interpretable in terms of a specified universe 



of content and instructional ly re levant tasks . 

Both NRM and CRM help in making basic decisions con- 
cerning individuals and programs. However ^ the score inter- 
pretation is different in these two measures. A normative 
score indicates how well an individual performed on a 
measure in relationship to others on the same measure. A 
criterion score is directly interpretable as to what eta 
individual can or cannot do in relationship to a specified 
universe of content, 

■/ 

ma Tor 



The maj 
the purpose of 



differences between the two measures lie in 



\;he test, the manner in which it is construct- 
ed, the specificity of the information obtained about the 
domain of relevant^ tasks, the generalization of the test 
performance, and the use of the score, ^ 
In determining which type of measurement to use, if 
there is the need for selectivity or competitive comparison 



among individuals, NRM should be used, CRM should be used 
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to determine whether a person has mastered certain knowledges, 
understandings, and skills. CRM can also be used with any 
type of programmed learning or individualized instruction, 
and for promotion and licensing procedures. 

When writing a CRM, the test constructor must make 
sure that the test items accurately sample the range of 
criterioa behaviors being measured. Criterion relevance, 
deficiency, and contamination should be analyzed. The items 
must possess congruency with the univex's^ of instructionally 



reTevant tasks 



/ 

The first step in evaluating training outcomes is to 
define precisely what is to be measured. ^ This is accomplished 
by writing behavioral or performance objectives for all de- 
sired outcomes. These behavioral objectives must be trans- 
lated into specific test tasks which form the basis for 
inference that the behaviors have been^acquired by the in- 
dividual . 

The most important requirement when writing a CRM is 
that an objective, systematic procedi^re be used to specify 
the domain of tasks required to be performed. One such method 
is through the use of an item form, which consist of a general 
form and generation rules which specifically defines the re- 
quired tasks. The item form can be used to generate many 
different items with ^ fixed syntactical structure. Thus, a 
collection of item forms define the universe of content for the 
test. 



56 

- The major concern of CRM experts is the need for 

evaluating how '^good** a criterion-referenced tost is. While 
there are many textbooks and articles wri'tterii about the 
well-honed mental test theory prpoodures (norn. -referenced 
tests) , there are very few guides available on criterion- 
referenced tests. Since most of the traditional theories 
and formulas for determining the adequacy of a NRM are based 
on variance, they cannot be applied to criterion-referenced 
measures. Variability is irrelevant with CRM because the 
meaning of the scores flows directly from the connection 
between the items to the criterion. 

Several variations of traditional test theory have' 
been suggested for evaluating the adequacy of a criterion 
measure. Content validity is the main method to. evaluate 
if the test measures what it purports to measure. Equivalent 
forms and sequentially scaled tests have been proposed to be 
used to estimate the consistency or reliability of the test. 
A pretest, posttest discrimination index could be used to 
evaluate a test item, and the traditional upper group minus 
lower group could be used with limitations. 

Conclusions 

The literature would appear to support the following 
conclusions: 

1. Although experts do not agree on a single de- 

f ini tion of criterion-referenced measurement^ all va riations 

» 

have in common an emphasis on the interpretation of the 

> 
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individual's score, which represents what an individual can 
do relative to the instructional objectives of a program. 

2. Criterion-referenced information is valuable in 

making instructional decisions based on what a person can do 
f 

at a certain time m the training" cycle . If training is go- 
ing to becqme more adaptive to thC individual, this i-hpirt 
is a necessity. ^ 

3. CRM have focused much attention on behavioral 
objectives and desired trainee outcomes. Detailed specifi- 
cations of test construction processess and experimental 
evidence relating behavior to test performance appear to be 
a promising approach to the measurement of competencies in 
training . 

4. Behavioral objective must be carefully written 
in order tp more validly direct t^e instructional design and 
measure its effectiveness. 

5. More than one method should be used to validate 
any desired CRM in order to decrease the error that is 
associated with its measurement. 

6. It is difficult to develop objective procedures , 
neces&ary for CRM of complex behavior AlFor complex behavioral 
domains, until explicit models' sta€ed in measurable terms are 
develope^y there Is too »much of a degree of subjectivity in 
this type of test construction. 

7. CRM supplements but should not replace normative 
tests in training. ^Both g^re essential for making decisions 




about the training process. The more simple, clear, and direct 
test results can be presented, the more useful and instruction- 
ally fruitful tests are likely to be. 

8. CRM seem interesting and relevant for today's 
training systems, but there Is need for research, both 
theoretically and empirically, before extensive use of it can 
be recommended in an instructional environment. 
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